-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AN-380 Make call cache hashing strategy configurable per filesystem and backend #7683
base: develop
Are you sure you want to change the base?
Conversation
…nto jd_AN-380_hash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matches the mental picture I developed during mob times. Nice work!
###### Notes on GCS Hashing Strategy | ||
|
||
For some high-throughput production use cases that run many, many copies of the same task differing by only one input file, | ||
the collision rate of `crc32c` may be unacceptably high. To dramatically reduce the chance of collision at the cost of | ||
reducing the collection of tasks that can be call cached, we recommend `hashing-strategy: ["md5", "identity"]`. This | ||
will use `md5` hashes when they exist, and fall back to the very strict `identity` strategy when they do not. Because | ||
all GCS files created by Cromwell are guaranteed to have `md5`, `identity` comes into play only for user-provided workflow | ||
input files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good explanation, one thing to add could be "why?" md5 does not exist, namely parallel composite upload
@@ -174,23 +176,39 @@ object GcsBatchSizeCommand { | |||
file.objectBlobId.map(GcsBatchSizeCommand(file, _)) | |||
} | |||
|
|||
case class GcsBatchCrc32Command(override val file: GcsPath, override val blob: BlobId, setUserProject: Boolean = false) | |||
extends IoHashCommand(file) | |||
case class GcsBatchHashCommand(override val file: GcsPath, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extremely optional bikeshedding.
I've never been the biggest fan of this name, since this command is for a single file... it does not contain a batch.
Doubly so now that GCP Batch competes with GCS batch for headspace.
case class GcsBatchHashCommand(override val file: GcsPath, | |
case class GcsHashCommand(override val file: GcsPath, |
private def getFileHashForGcsPath(gcsPath: GcsPath): IO[FileHash] = delayedIoFromTry { | ||
gcsPath.objectBlobId.map(id => FileHash(HashType.GcsCrc32c, gcsPath.cloudStorage.get(id).getCrc32c)) | ||
} | ||
private def getFileHashForGcsPath(gcsPath: GcsPath, hashStrategy: FileHashStrategy): IO[Option[FileHash]] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also extremely optional, I wonder if we could apply ✨ More Types ✨ to DRY out this group of functions.
private def getFileHashForGcsPath(gcsPath: GcsPath, hashStrategy: FileHashStrategy): IO[Option[FileHash]] = | |
private def getFileHashForPath[T](path: T, hashStrategy: FileHashStrategy[T]): IO[Option[FileHash]] = |
fsConfigs <- configurationDescriptor.backendConfig.as[Option[Config]]("filesystems").toList | ||
fsKey <- fsConfigs.root.keySet().asScala | ||
configKey = s"${fsKey}.caching.hashing-strategy" | ||
// TODO this allows users to override with an empty list to prevent caching, is that desirable or a footgun? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a global call-caching.enabled
flag that seems like a better choice.
I don't it makes sense to let users toggle call caching on a per-filesystem basis because one backend can use multiple filesystems in the same task (e.g. GCS + DRS).
I would make empty list an exception if call-caching.enabled
is true
.
Description
Code is ready for review, still todo:
Release Notes Confirmation
CHANGELOG.md
CHANGELOG.md
in this PRCHANGELOG.md
because it doesn't impact community usersTerra Release Notes